Understanding and predicting customer satisfaction levels are crucial
for any company, particularly in the airline industry. High satisfaction
levels often lead to repeat business, positive word-of-mouth, and a
competitive edge in the market. Conversely, dissatisfaction can result
in negative reviews, loss of customers, and decreased revenue. By
accurately predicting passenger satisfaction, airlines can tailor their
services to meet customer needs better, improve overall experience, and
implement strategic changes that foster customer loyalty and enhance
operational efficiency.
In the following report, I will work with a variety of indicators to build multiple machine learning models that predict airline passenger satisfaction. Key indicators will focus on customer characteristics and various amenities and services provided by the airline. The model with the highest area under the ROC curve (ROC AUC) will be selected as the final model to predict the satisfaction level of airline customers.
The roadmap for this project is structured into several key phases. First, we will focus on cleaning the dataset, handling missing values, and encoding categorical variables to prepare the data for analysis. Following this, we perform exploratory data analysis (EDA) to understand the underlying patterns and relationships within the data. The next phase involves splitting the data into training and testing sets to facilitate model development and evaluation. We then build and train our models, tune with cross-validation if needed and evaluate the best performing model. Whichever model performs best will be fit to our testing data.
Before we go straight into visualizing our data with plots and tables, we need to make sure our data is tidy. Not all data we load in will be clean and ready for analyzing. In this section we will look for missing values, convert categorical variables into factors and remove columns we don’t need.
The data for this project is taken from the Kaggle data set, “Airline
Passenger Satisfaction.” It was scraped from the Passenger
Satisfaction Database and cleaned up by TJ Klein. The data
originated from a survey conducted to assess passenger satisfaction with
US airlines.
Klein, T. (2020). Airline Passenger Satisfaction, Version 1.
Retrieved April 12, 2024 from https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data.
I will first load the data in and clean up the column names.
# Load in data
airline <- read.csv("data/airline.csv")
# Clean predictor names
airline <- airline %>% clean_names()
Let’s take a look at how much data we have and the datatype for each variable.
# check how many rows and columns
airline %>% dim()
## [1] 103904 25
# check internal structure
airline %>% str()
## 'data.frame': 103904 obs. of 25 variables:
## $ x : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
## $ gender : chr "Male" "Male" "Female" "Female" ...
## $ customer_type : chr "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ age : int 13 25 26 25 61 26 47 52 41 20 ...
## $ type_of_travel : chr "Personal Travel" "Business travel" "Business travel" "Business travel" ...
## $ class : chr "Eco Plus" "Business" "Business" "Business" ...
## $ flight_distance : int 460 235 1142 562 214 1180 1276 2035 853 1061 ...
## $ inflight_wifi_service : int 3 3 2 2 3 3 2 4 1 3 ...
## $ departure_arrival_time_convenient: int 4 2 2 5 3 4 4 3 2 3 ...
## $ ease_of_online_booking : int 3 3 2 5 3 2 2 4 2 3 ...
## $ gate_location : int 1 3 2 5 3 1 3 4 2 4 ...
## $ food_and_drink : int 5 1 5 2 4 1 2 5 4 2 ...
## $ online_boarding : int 3 3 5 2 5 2 2 5 3 3 ...
## $ seat_comfort : int 5 1 5 2 5 1 2 5 3 3 ...
## $ inflight_entertainment : int 5 1 5 2 3 1 2 5 1 2 ...
## $ on_board_service : int 4 1 4 2 3 3 3 5 1 2 ...
## $ leg_room_service : int 3 5 3 5 4 4 3 5 2 3 ...
## $ baggage_handling : int 4 3 4 3 4 4 4 5 1 4 ...
## $ checkin_service : int 4 1 4 1 3 4 3 4 4 4 ...
## $ inflight_service : int 5 4 4 4 3 4 5 5 1 3 ...
## $ cleanliness : int 5 1 5 2 3 1 2 4 2 2 ...
## $ departure_delay_in_minutes : int 25 1 0 11 0 0 9 4 0 0 ...
## $ arrival_delay_in_minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : chr "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
In this data set, there are 103,904 rows and 25 columns. In other
words, we have 103,904 responses from customers and 25 variables. The
variable we are trying to predict is satisfaction, which
leaves us with 24 predictor variables. Let’s try to narrow down the
number of predictors.
The two variables I will exclude are # and
id because these variables help to identify the individual
who has taken the survey. These variables do not contribute to customer
satisfaction. Also, notice how arrival_delay_in_minutes is
of num datatype. Let’s change that to an integer.
# remove column x and id
airline <- airline %>% select(-x, -id)
# change datatype of variable
airline$arrival_delay_in_minutes = as.integer(airline$arrival_delay_in_minutes)
Finally, let’s change our categorical variables into factors.
airline$gender <- as.factor(airline$gender)
airline$customer_type <- as.factor(airline$customer_type)
airline$type_of_travel <- as.factor(airline$type_of_travel)
airline$class <- as.factor(airline$class)
airline$satisfaction <- as.factor(airline$satisfaction)
Now that we have dropped some columns and changed the data types, let’s see if there are missing values in our data set.
# check for percentage of missing values
vis_miss(airline, warn_large_data = FALSE)
# summary statistic
summary(airline$arrival_delay_in_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 15.18 13.00 1584.00 310
The data set has less than 0.1% missing data in the
arrival_delay_in_minutes column. This might seem like a
very small percentage, but in absolute terms, it translates to 310
missing values. I will replace the NA’s with its median=0 to normalize
it.
# replace NA with 0
airline$arrival_delay_in_minutes[is.na(airline$arrival_delay_in_minutes)] <-0
Our data now includes the following 23 variables:
gender Gender of the passengers (Female, Male)
customer_type The customer type (Loyal customer, disloyal
customer)
age The actual age of the passengers
type_of_travel Purpose of the flight of the passengers
(Personal Travel, Business Travel)
class Travel class in the plane of the passengers
(Business, Eco, Eco Plus)
flight_distance The flight distance of this journey
inflight_wifi_service Satisfaction level of the inflight
wifi service (0:Not Applicable;1-5)
departure_arrival_time_convenience Satisfaction level of
Departure/Arrival time convenient
ease_of_online_booking Satisfaction level of online
booking
gate_location Satisfaction level of Gate location
food_and_drink Satisfaction level of Food and drink
online_boardingSatisfaction level of online boarding
seat_comfort Satisfaction level of Seat comfort
inflight_entertainment Satisfaction level of inflight
entertainment
on_board_service Satisfaction level of On-board
service
leg_room_service Satisfaction level of Leg room
service
baggage_handling Satisfaction level of baggage
handling
checkin_service Satisfaction level of Check-in
service
inflight_service Satisfaction level of inflight
service
cleanliness Satisfaction level of Cleanliness
departure_delay_in_minutes Minutes delayed when
departure
arrival_delay_in_minutes Minutes delayed when Arrival
satisfaction Airline satisfaction level(Satisfaction,
neutral or dissatisfaction)
Now let’s start visualizing the data to find relationships between variables.
First, let’s explore the distribution of the response variable
satisfaction to make sure that our data is balanced.
# pie chart
airline %>%
count(satisfaction) %>%
mutate(percent = prop.table(n) * 100) %>%
ggplot(aes(x = "", y = n, fill = satisfaction)) +
geom_bar(width = 1, stat = "identity", color = "white") +
geom_text(aes(label = paste0(round(percent), "%")), position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
labs(title = "Distribution of Satisfaction Levels") +
theme_void()
The satisfaction variable contains two possible values:
‘neutral or dissatisfied’ and ‘satisfied.’ From the pie chart, we can
see that the data is quite balanced and thus, does not need any
adjustments or resampling.
Next, I will use a correlation plot to explore the relationships between the numeric variables.
# correlation matrix (numerical)
airline %>%
select_if(is.numeric) %>%
cor(use = "complete.obs") %>%
corrplot(type = 'lower', diag = FALSE, method = 'shade', tl.cex = 0.7)
The correlation matrix does not display any red colors, meaning that the
variables are positively or neutrally correlated with each other.
Notably, departure_delay_in_minutes is directly correlated
with arrival_delay_in_minutes. This makes sense because if
a flight departs late, it will also arrive late to its
destination.
Additionally, the different amenities and services provided by the
airline have a positive correlation with each other. For example,
cleanliness is highly positively with
food_and_drink, seat_comfort and
inflight_entertainment. ease_of_online_booking
is correlated with inflight_wifi_service.
food_and_drink is positively correlated with
seat_comfort and inflight_entertainment.
Understanding these groupings and correlations helps us identify the key
areas that influence passenger satisfaction and provides insights for
improving the overall customer experience.
airline %>%
pivot_longer(
cols = c(
inflight_wifi_service, departure_arrival_time_convenient, ease_of_online_booking,
gate_location, food_and_drink, online_boarding, seat_comfort, inflight_entertainment,
on_board_service, leg_room_service, baggage_handling, checkin_service, inflight_service,
cleanliness
),
names_to = "Amenity",
values_to = "Satisfaction"
) %>%
ggplot(aes(x = Amenity, y = Satisfaction)) +
geom_boxplot(fill = "#00BFC4", color = "blue") +
labs(
title = "Comparison of Satisfaction Levels for Amenities and Services",
x = "Amenity and Services",
y = "Satisfaction Level"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The boxplot illustrates the distribution of passenger satisfaction
levels for various airline services and amenities. Most services show a
median satisfaction level around 3, indicating moderate satisfaction.
Notable exceptions include inflight_service,
inflight_entertainment, leg_room_service and
on_board_service, all having higher medians around 4 .
baggage_handling also stands out with relatively high
satisfaction, evidenced by its median around 4 and a higher
interquartile range. Outliers in check-in service indicate a few very
low satisfaction ratings, while most other categories have relatively
fewer outliers, indicating more consistent satisfaction levels among
passengers.
# percent stacked bar chart (gender)
airline %>%
ggplot(aes(x=satisfaction, fill=gender)) +
geom_bar(position="fill") +
ggtitle("Satisfaction Proportion by Gender")
It’s evident that both male and female customers exhibit similar
proportions of neutral/dissatisfied and satisfied satisfaction levels.
This suggests that gender might not play a significant role in
influencing customer satisfaction levels within this dataset.
# percent stacked bar chart (type of travel)
airline %>%
ggplot(aes(x=satisfaction, fill=type_of_travel)) +
geom_bar(position="fill") +
ggtitle("Satisfaction Proportion by Travel Type")
Business traveling customers and personal traveling customers exhibit
similar proportions of neutral/dissatisfied. Meanwhile, the disparity in
satisfaction levels between business and personal travelers is striking,
with over 90% of satisfied customers falling into the former
category.
# percent stacked bar chart (class)
airline %>%
ggplot(aes(x=satisfaction, fill=class)) +
geom_bar(position="fill") +
ggtitle("Satisfaction Proportion by Travel Class")
In the satisfied category, a notable 75% of passengers are in business
class, indicating a higher level of satisfaction among these travelers.
Conversely, in the neutral/dissatisfied category, 50% of passengers are
in economy class. This trend clearly demonstrates that booking a higher
class correlates with a better travel experience.
Now that we know which variables will affect the airline satisfaction level, we can begin to build our models. First we will split our data into training and testing sets, create our recipe and implement a k-fold cross validation within our models.
Before we set up our models, we need to split the data into training
and testing sets. The training set is used to train the model, while the
testing set is used to evaluate the model performance. I have decided
split 80% of the data to the training set and the remaining 20% to the
testing set. This split ensures that our model is being trained on most
of the data we have, whilst also having a portion set aside for testing.
When the data is split, it will be stratified on the
satisfaction variable so that the satisfaction
variable is equally distributed in both the training and testing data.
We will also set a seed so that we can reproduce our results.
# set seed
set.seed(14789)
# split our data into training and testing
airline_split <- initial_split(airline, prop = 0.80, strata = satisfaction)
airline_train <- training(airline_split)
airline_test <- testing(airline_split)
nrow(airline_train)/nrow(airline)
## [1] 0.7999981
nrow(airline_test)/nrow(airline)
## [1] 0.2000019
To confirm that the data was split correctly, we divide the number of observations in the training/testing sets by the number of observations in the original set. The training data has approximately 80% of the data and the testing set has 20%. This means our data was split correctly.
Next, we will create a recipe for our models to use. We can use the
same recipe for each of our models because we will be using the same
predictors and response variable. We will be using 21 out of the 22
predictors we have. gender will not be included in our
recipe because earlier we found that there is no difference in
satisfaction level between males and females. Next, we will dummy code
all the categorical predictors. Finally, we must make sure to center and
scale all the predictors.
airline_recipe <- recipe(satisfaction ~ customer_type + age + type_of_travel + class + flight_distance + inflight_wifi_service + departure_arrival_time_convenient + ease_of_online_booking + gate_location + food_and_drink + online_boarding + seat_comfort + inflight_entertainment + on_board_service + leg_room_service + baggage_handling + checkin_service + inflight_service + cleanliness + departure_delay_in_minutes + arrival_delay_in_minutes, data = airline) %>%
step_dummy(all_nominal_predictors()) %>% # dummy-code all categorical predictors
step_center(all_predictors()) %>% # standardizing predictors
step_scale(all_predictors())
As our final step, we will implement k-fold cross-validation on the
training data set. k-fold cross-validation is a resampling method used
in machine learning to evaluate the performance of a model. It involves
splitting the data set into k folds. One of the folds is used as a
validation set while the remaining folds is used for training the model.
The process is repeated k times, each time using a different fold as the
validation set. Then, the results from the k iterations are averaged to
produce an estimate of the model’s performance.
We will use k=10 and stratify on our response variable
satisfaction.
airline_fold <- vfold_cv(airline_train, v = 10, strata = satisfaction)
We can finally start building our models. The process for building
and fitting the models is very similar across all the models. The
general steps are:
Specify the model and model parameters, the engine to use and the
model type (in this case classification)
Set up a workflow and add the model and recipe to the
workflow
If there are models that can be tuned, set up a tune grid by
specifying the range of values and number of levels for each
parameter
Tune the model with the workflow, along with the k-folds and
tuning grid
Select the most accurate model from the tuning grid, and then
finalize the workflow using those selected tuning parameters.
Fit that model with our workflow to the training data set
Save the results to a RDA file so we don’t have to rerun the
code
I will use roc_auc to evaluate the performance of the models. The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. The AUC (area under the curve) tells how much the model is capable of distinguishing between the classes. A higher roc_auc is better. The range of roc_auc is 0 to 1, with 1 indicating perfect classification.
I fit four different models: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis and Random Forest. The first three models did not take long to run, but the Random Forest model required more time and computing power. To save time, each model was run in a separate R file and then saved in a RDA file. I will attach these files in a GitHub repository if you want to refer to it. Here, I will load in the RDA files so we can analyze the performance of each model.
load("RDA/airline_Linear_Discriminant_Analysis.rda")
load("RDA/airline_Logistic_Regression.rda")
load("RDA/airline_Quadratic_Discriminant.rda")
load("RDA/airline_Random_Forest.rda")
The autoplot() function is a useful tool that can help us visualize the results of the models we have tuned.
In our Random Forest model, we tuned 3 parameters:
mtry: number of predictors that will be randomly sampled at
each split during the creation of the model
trees: number of trees contained in each forest
min_n: minimum number of observations needed in a leaf node
to create another split
As the number of predictors increased, the accuracy also increased. Similarly, as the number of trees increased, the ROC AUC also increased. The optimal minimal node size seems to be 50. The bottom middle plot is the most effective model, with 10 randomly selected predictors, minimal node size of 50 and 10 trees.
autoplot(airline_rf_tune)
I will create a tibble to display the final roc_auc value for each fitted model. This will help determine the best model to be used to fit the testing data.
airline_lda_auc <- augment(airline_lda_fit, new_data = airline_train) %>%
clean_names() %>%
roc_auc(satisfaction, pred_neutral_or_dissatisfied) %>%
select(.estimate)
airline_log_auc <- augment(airline_log_fit, new_data = airline_train) %>%
clean_names() %>%
roc_auc(satisfaction, pred_neutral_or_dissatisfied) %>%
select(.estimate)
airline_qda_auc <- augment(airline_qda_fit, new_data = airline_train) %>%
clean_names() %>%
roc_auc(satisfaction, pred_neutral_or_dissatisfied) %>%
select(.estimate)
airline_rf_auc <- augment(airline_rf_fit_model, new_data = airline_train) %>%
clean_names() %>%
roc_auc(satisfaction, pred_neutral_or_dissatisfied) %>%
select(.estimate)
airline_roc_aucs <- c(airline_lda_auc$.estimate,
airline_log_auc$.estimate,
airline_qda_auc$.estimate,
airline_rf_auc$.estimate)
airline_mod_names <- c("LDA",
"Logistic Regression",
"QDA",
"Random Forest")
airline_results <- tibble(Model = airline_mod_names,
ROC_AUC = airline_roc_aucs)
airline_results <- airline_results %>%
arrange(-airline_roc_aucs)
airline_results
Our Random Forest model has the highest roc_auc value at 0.997. The other models performed well too, having a roc_auc value of over 0.9. Moving forward, we will be using the Random Forest model to predict the testing set.
Now that we have determined the Random Forest model had the best overall performance, we want to test it on data that it hasn’t seen yet. Since the model was orignally trained on our training data, the ROC AUC score was very high. Within our Random Forest model, which model performed the best?
show_best(airline_rf_tune, metric = "roc_auc") %>%
select(-.estimator, .config) %>%
slice(1)
The Random Forest Model #125 performed the best out of all the random forest models. We will finally fit our best model to our testing set to see how well it predicts the satisfaction level of passengers.
We will now create predictions on our testing data set. I will also add the columns from the testing data set to compare the results.
airline_predict <- predict(airline_rf_fit_model, new_data = airline_test, type = "class")
predict_actual <- airline_predict %>%
bind_cols(airline_test)
predict_actual
At a glance, our model predicted the satisfaction
variable pretty well. To fully test the accuracy of the model, we will
find model #125’s true ROC AUC performance results on our testing
set.
airline_roc_auc <- augment(airline_rf_fit_model, new_data = airline_test) %>%
clean_names()%>%
roc_auc(satisfaction, pred_neutral_or_dissatisfied) %>%
select(.estimate)
airline_roc_auc
The ROC AUC value of model #125 is 0.9938937, which is quite high.
Let’s take a look at the ROC curve. We want the curve to be as high and to the left as possible, which is the case here. This means that our model performed very well, as can be reflected by the ROC AUC result from above.
airline_roc_curve <- augment(airline_rf_fit_model, new_data = airline_test) %>%
clean_names() %>%
roc_curve(satisfaction, pred_neutral_or_dissatisfied)
autoplot(airline_roc_curve)
After fitting multiple models, the best model to predict the airline
passenger satisfaction was the random forest model. This may be due to
its ability to handle complex interactions between variables, its
robustness against overfitting or its nonparametric algorithm. The
construction of multiple decision trees and the output of the mode of
classes allows the model to capture a wide variety of patterns in the
data.
The model that performed the worst is the Quadratic Discriminant
Analysis model. QDA requires estimating a separate covariance matrix for
each class, which can be problematic with high-dimensional data.
Furthermore, it can also be due to its sensitivity to the assumption of
normally distributed features within each class. However, even though
the QDA model performed the worst out of the four models, it still had a
very high accuracy.
If my laptop had a high computing power, I would try to implement
other models such as the Gradient Boosted Tree, which builds trees
sequentially, with each tree attempting to correct the errors of the
previous one. I would also consider the Support Vector Machine, which
can be particularly effective for classification tasks with complex,
non-linear decision boundaries when combined with the appropriate
kernel.
Furthermore, I would also like to explore if predictors such as time
will affect passenger satisfaction. Examining whether satisfaction
scores tend to drop during holiday seasons due to higher passenger
volumes can guide targeted service improvements during these
periods.
Overall, this was a very fun project that helped build my machine learning skills. As I immersed myself into this project, I found that I enjoyed learning how to make my models more accurate.